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Abstract 

The purpose of this study is to examine the number of DIF items detected by HGLM at different sample sizes. 
Eight different sized data files have been composed. The population of the study is 798307 students who had 
taken the 2006 OKS Examination. 10727 students of 798307 are chosen by random sampling method as the 
sample of the study. Turkish, science, and social studies subtests, all composed of 25 items and applied in the 
OKS-2006, are used as data gathering instruments in this study. It has been concluded that varieties in sample 
sizes have a great effect on DIF detection in test items. 
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In the studies of social sciences having accurate, 
highly reliable and acceptable measurements and 
discussions is really hard and very important partly 
because of the nature of variables examined. Be¬ 
cause social sciences based on human beings, it is 
sometimes technically insufficient to measure the 
nature of human beings as they have such a com¬ 
plex structure. In physical sciences, with the avail¬ 
ability of direct measures, the determination of 
the direction and magnitude of the systematic and 
fixed errors which have effects on the measurement 
results is much easier. However; in social sciences, 
it is not easy to determine the direction and mag¬ 
nitude of systematic and fixed errors in measure¬ 
ment results, as the measurements are commonly 
indirect. In educational studies, psychological con¬ 
structs of individuals such as achievement, ability, 
and personality are often measured. It is impor¬ 
tant to answer the questions of how to measure 
psychological constructs of individuals and what 
decisions to be made according to measurement 
results. As these two questions are so critical, the 
size of systematic and fixed errors affecting meas- 
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urement results becomes more important for the 
validity of measurement instruments and results. 

With the validity of test items and measurement in¬ 
struments used in education, the validity of meas¬ 
urement is one of the main problems of bias meas¬ 
uring. As it is known, one of the main objectives of 
measuring applications in education is to obtain in¬ 
formation about individuals and test items. Highly 
valid and accurate measurement instruments and 
results are needed to achieve this objective. How¬ 
ever; one of the factors which have a negative effect 
on validity is biased items. The existence of biased 
items in a test decreases the reliability of the discus¬ 
sions made. 

Item bias is said to be a result of “systematic er¬ 
rors” which have an effect on measurement results. 
It does not affect all the results equally owing to 
the description of systematic errors. The existence 
of items including systematic errors is a problem 
strongly related to the validity of the test. In valid¬ 
ity analysis, it is important to detect biased items 
among the test items. This is about detection of 
“Differential Item Functioning” which can be de¬ 
termined by statistical methods. 

In recent studies, differential item functioning 
(DIF) typically refers to item bias (Ellis & Raju, 
2003). In the late 1980s, the term “DIF” have 
changed place with the term “item bias.” DIF re- 
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veals the differences in the probability of answer¬ 
ing the item correctly according to the subgroups 
at every ability level of the psychological structure 
that is intended to be measured with the item (Em- 
bretson & Reise, 2000; Lord, 1980). In studies on 
DIF, there is a requirement of performance com¬ 
parison on test items of groups in the same capa¬ 
bility level but having different demographic char¬ 
acteristics such as male-female or Asian-European 
(Greer, 2004). 

In the case of existence of DIF in the test items, this 
may be caused by real differences (item impact) or 
item bias in the subgroups (Zumbo, 1999). There 
are lots of methods for DIF detection. Some of 
these methods are based on classical test theory. 
Mantel-Haenszel (M-H), LR and SIBTEST are the 
examples of the methods based on classical test 
theory (Gierl, Khaliq, & Boughton, 1999). Some 
DIF detection methods such as Lord’s chi square 
test, Raju’s area measures and likelihood ratio are 
the samples of DIF detection methods based on 
item response theory (Ogretmen, 1995; Zwick, 
Donoghue, & Grima, 1993). Most of these methods 
provide similar information about DIF. There are 
lots of DIF detection studies made by M-H tech¬ 
nique in the literature (Allalouf, 2003; Duncan, 
2006; Gondal, 2001; Hamzeh & Johanson, 2003; 
Ogretmen, 2006; Randall, 2001; Yildinm, 2006; 
Yurdugul, 2003). LR method and likelihood ratio 
based on the item response theory gained impor¬ 
tance against M-H method in DIF detection stud¬ 
ies by the late developing methods. However; in ed¬ 
ucational research, it has been discovered that data 
are in a hierarchical structure. As a result, HGLM 
method became remarkable in DIF detection stud¬ 
ies (Chaimongkol, Huffer, & Kamata 2007; Kama- 
ta, Chaimongkol, Gem;, & Bilir 2005; Luppescu 
2002; Vaughn 2006; Williams 2003). HGLM, M-H 
and logistic regression methods are similar to each 
other as they are based on observed scores (Binici, 
2007). This study focuses on the HGLM method. 
HGLM is a method that derives linear equations 
which explains individuals’ characteristics and 
characteristics of group members as a function of 
the group formed by individuals and group mem¬ 
bers. Estimator variables of students’ character¬ 
istics are added to level2 model in order to detect 
whether the characteristics of students have an ef¬ 
fect on the possibility of giving answer correctly to 
test items or not- which is a DIF detection study 
on item. In HGLM, level 1 (item level) and level 2 
(individual level) modeling in which item scores 
(result) have two categories are set (Kamata, 2002). 


Purpose of the Study 

The purpose of this study is to examine the number 
of DIF items detected by HGLM at different sam¬ 
ple sizes. In tests which measures different skills, 
examination of effects of sample size on DIF is im¬ 
portant as HGLM is a new method. 


Method 

This study is a descriptive research which examines 
whether the DIF results determined by the HGLM 
Method vary with the sample size or not. 


Sample 

The population of the study is 798307 students who 
took the 2006 OKS Examination. 10727 students of 
798307 are chosen by random sampling method as 
sample. 


Instrument 

Turkish, science and social studies subtests, all 
composed of 25 items and applied in the OKS- 
2006, are used as data gathering instruments in this 
study. 


Data Analysis 

As the DIF detecting study is made according to 
gender, subgroups were made according to vari¬ 
ety of gender. Female students were chosen to be 
the focus group and male students were chosen to 
be the reference group. HLM-6.04 (Raudenbush, 
Bryk, Cheong 8c Congdon, 2001) program was 
used in DIF detection study by HGLM. In HGLM, 
level-1 and level-2 equations are established as fol¬ 
lows, to determine the DIF with conditional mod¬ 
eling (Kamata, 2002): 


Level-1 Equation (Item Level): To show the i 
(i=l,2,..../c) item and; (j= 1,2,...JV) individual in- 

dpv 
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A, + A j^uj + PyXiij + ••• + J 




rj ..: Estimated outcome variable, i.e., the probabil¬ 
ity of the individual j in giving the correct answer 
to the item i. 


X : Indicator variable for item i. When the answer 
given to an item is on item i ( q=i ), the value is 1, 
and in other condition (q ^ /), the value is 0. 

fl Q .: It is the breakpoint. When all X.. become 0, 
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the effect of the item that is not considered for the 
model occurs. Hence, B n . is the effect of the item 
that is not considered for the model. 

: It is the effect of item 1 on the probability (out¬ 
come variable) of individual ;' to give the correct 
answer up to i= l,2,...(/c-l). The parameters from 
I3 y to is a coefficient that shows the effects of 
the items on the probabilities of giving the correct 
answer for the individual from item 1 to item k. 
Individual ;' is associated with different individu¬ 
als and different item-level parameters. If the level 
increases, then ;' in B.. decreases, and the item pa¬ 
rameters are kept constant among the individuals. 

Level 2 is employed to determine the differences 
between the probabilities of answering each item 
correctly according to the genders of the students. 

Level 2 (Student Level) Equation: 

Poj = Yoo + 7oi (Gender) ; + w o; 

P\j = Y\o + Yu (Gender) j 


P{k-X)j - Y(k- 1)0 + Y{k- 1)1 (Gender) j 

/T: It is the effect of item i on the probability of giv¬ 
ing the correct answer for individual; up to *= 1 , 2 ,... 
(k- 1). The parameters from to are the ef¬ 
fects of the items on the probability of giving the 
correct answer from item 1 to item k for the indi¬ 
vidual;'. 

Y 00 : is the referred item parameter. 


Y 0l : is the difference in the probabilities of giving 
the correct answer to the related item of the stu¬ 
dents under the conditions of male and female 
(gender). In other words, it is the effect of the prob¬ 
ability of giving the correct answer to item i with 
respect to the gender variable. 

« 01 is the effect of random gender variable. It is the 
random effect of bOj , which shows normal distri¬ 
bution that has a distribution average of 0 and vari¬ 
ance of T. 

As the purpose of this study is to examine variety in 
number of DIF items obtained by HGLM accord¬ 
ing to different sample sizes, 8 different sized data 
files have been composed. Sample sizes have been 
defined again like; 1%, 2%, 5%, 10%, 25%, 50%, 
75%, 100% of 10727 students. Observation num¬ 
bers related to the 8 different samples are shown 
in Table 1. 

DIF analysis by HGLM, have been made on differ¬ 
ent sample sizes which have had varying observa¬ 
tion numbers between 97 and 10727. While exam¬ 
ining the reliability coefficients of estimations, it 
has been observed that especially in Turkish and 
social studies there have been sufficient reliability 
despite smaller samples in subtests. 

Results and Discussion 

DIF analysis by HGLM method according to gen¬ 
der, have been obtained from 8 different sized sam¬ 
ples for subtests of Turkish, science and social stud¬ 
ies separately. Numbers of DIF items obtained by 
HGLM method at different sample sizes are given 
in Table 2. 

In detection of DIF items, two levels of significance 
have been considered: 0.05 and 0.01. When the 


Table 1. 

Sample Sizes and Reliability of Estimations 

Representative Sample rate 

Sample sizes 


Reliability of Estimations 

Turkish 

Science 

Social Studies 

1% 

97 

0.850 

0,661 

0,820 

2% 

207 

0.829 

0,728 

0,828 

5% 

532 

0.810 

0,733 

0,832 

10% 

1055 

0.815 

0,762 

0,837 

25% 

2681 

0,815 

0,754 

0,840 

50% 

5320 

0,815 

0,750 

0,839 

75% 

8037 

0,816 

0,751 

0,838 

100% 

10727 

0,818 

0,752 

0,840 
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Table 2. 

Number of Observed DIF Items Related to Sample Sizes 


Sample Sizes 

Turkish 


Science 


Social Studies 


<0.05 

<0.01 

<0.05 

<0.01 

<0.05 

<0.01 

97 

2 

1 

1 

0 

0 

0 

207 

2 

1 

1 

1 

0 

0 

532 

0 

0 

2 

0 

0 

0 

1055 

0 

0 

3 

2 

3 

0 

2681 

11 

5 

4 

1 

6 

3 

5320 

11 

6 

7 

6 

9 

7 

8037 

12 

11 

10 

7 

13 

10 

10727 

12 

12 

10 

8 

15 

13 


representation ratio of the sample is 25% (n=2681), 
a remarkable differentiation in the number of DIF 
items have been obtained at different significance 
levels. As the numbers of individuals in samples 
has increased, the number of DIF items has also 
increased. The number of DIF items obtained in 
all subtests at 99% confidence level has been nearly 
the half of the number of DIF items obtained at 
95% confidence level. It has been concluded that 
as the confidence level increases, the number of 
DIF items decreases in all subtests and at different 
sample sizes. Another observation obtained in this 
study is that varieties in sample sizes have a great 
effect on DIF detection in test items. Vaughn (2006) 
has applied DIF analysis by the HGLM method on 
polytomous items in very small sized samples and 
has determined that the number of estimated DIF 
items is lower than in bigger samples. 

Miller and Spray (1993) applied on the multiple sc- 
orable mathematics test of 27 items, have implied 
that the size of samples have a great effect on DIF 
item detection, especially if a method based on 
likelihood ratio is used. As the HGLM method is 
based on the possibility of answering items cor¬ 
rectly’, the result that Miller and Spray obtained in 
their study is acceptable in this study also. 

In the subtests which measures different abilities, 
different numbers of DIF items have been obtained 
by HGLM. Various studies have shown that the 
presence of multidimensionality may cause DIF 
(Snow & Oshima, 2009). The undimensionality of 
the tests, used in the studies, have been examined 
and great values of DIF items related to gender 
have attracted notice despite undimentional tests. 

According to Roussos and Stout’s (1996) simula¬ 
tion studies, no ostensible differences between 
DIF detection results obtained by the M-H and 


SIBTEST methods have been seen in small-sized 
samples. French and Miller (1996) have applied 
DIF analysis in the samples that they have attrib¬ 
uted as small sample (n=500) and large sample 
(n=2000) by using M-H and logistic regression and 
have determined that logistic regression method is 
strongly capable of achieving more accurate results 
in larger sample sizes. Structurally, DIF detection 
methods by HGLM and logistic regression tech¬ 
niques are similar to each other. Hence, it can be 
said that HGLM method is a powerful method in 
DIF detection studies. In the DIF detection study 
by HGLM method made on data obtained from a 
mathematics test which is composed of 39 multi¬ 
ple choice items, it has been emphasized that good 
estimations can be obtained despite larger sample 
sizes (Binici, 2007). Luppescu (2002), have discov¬ 
ered that the results obtained by Rasch method and 
HGLM method are similar to each other, when 
the ratio of individuals in the focus group and the 
sample size is small. Kamata (2001), have proved 
the equality of Rasch method and HGLM method 
technically in his studies. 

Recommendations 

By considering the results obtained from the study 
and the literature, the following recommendations 
can be listed: 

1- The ratio of focus groups and reference groups 
considered in DIF detection analysis can be 
examined and discussions can be made on the 
results. 

2- DIF items detected by the HGLM method can 
be examined in the test which measures different 
learning fields. 

3- The cause of the existence of DIF items detected 
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by the HGLM method (item bias or item im¬ 
pact) can be determined with the opinions of 
professionals. 

4- It can be determined that if the number of DIF 
items detected by the HGLM method varies with 
test length. 
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